Validity in Early Theory
Validity in Early theory Validity in testing and assessment has traditionally been understood to mean discovering whether a test measures accurately what it is intended to measure (Hughes, 1989). Henning (1987) defines validity as follows:” validity in general refers to the appropriateness of a given test or any of its component parts as a measure of what it is purposed to measure.” Test validity refers to the degree with which the inferences based on test scores are meaningful, useful, and appropriate. Thus test validity is a characteristic of a test when it is administered to a particular population. Validating a test refers to accumulating empirical data and logical arguments to show that the inferences are indeed appropriate. (Goodneough, 1950) Three Types of validity in Early Theory ''' In the early days of validity investigation, validity was broken into three types that were typically seen as distinct. Each type of validity was related to the kind of evidence that would count towards demonstrating that a test was valid. Cronbach and Meehl (1955) described these as: · Criterion-oriented validity Predictive validity Concurrent validity · Content validity · Construct validity We will introduce each of these in turn, and then show how this early approach has changed. (''Glen Fulcher and Fred Davidson, 2007) '' According to Brown, there are five types of validity or evidence, which they are as follows: · Content-Related Evidence · Criterion-Related Evidence · Construct-Related Evidence · Consequential Validity · Face Validity On the other hand in ''Fundamental Considerations in Language Testing by Lyle F. Bachman '',P.236 ; Messik (1989), describes validity as ‘an interesting evaluate judgment of the degree to which empirical evidence and theoretical rational support the adequacy and appropriateness of inferences and actions based on test scores’. It has been traditional to classify validity into different types such as content, criterion, and construct validity. So, in comparison to the previous classification Consequential Validity and Face validity are not considered. '''1. 'Criterion-oriented validity' When considering Criterion-oriented validity, the tester is interested in the relationship between a particular test and a criterion to which we wish to make predictions. According to Samuel Messick 1989, 1996a, 1996b: Criterion-related validity evidence - seeks to demonstrate that test scores are systematically related to one or more outcome criteria. In terms of an achievement test, for example, criterion-related validity may refer to the extent to which a test can be used to draw inferences regarding achievement. Empirical evidence in support of criterion-related validity may include a comparison of performance on the test against performance on outside criteria such as grades, class rank, other tests and teacher ratings. Predictive validity is the term used when the test scores are used to predict some future criterion, such as academic success. If the scores are used to predict a criterion at the same time the test is given, we are studying concurrent validity. ''' '''2. 'Content validity' Content validity is defined as any attempt to show that the content of the test is a representative sample from the domain that is to be tested. (Glen Fulcher and Fred Davidson, 2007) According to Samuel Messick 1989, 1996a, 1996b: Content-related validity evidence - refers to the extent to which the test questions represent the skills in the specified subject area. Content validity is often evaluated by examining the plan and procedures used in test construction. Carroll (1980:67) argued that achieving content validity in testing English for Academic Purpose (EAP) consisted of describing the test takers, analyzing their “communicative need” and specifying test content on the basis of their needs. In early approaches to communicative language testing the central issue in establishing content validity was how best to sample from needs and target domain (Fulcher, 1999a: 222-223). 3. 'Construct validity ' In the early history of validity theory there was an assumption that there is such a thing as a “psychological real construct” that has an independent existence in the test taker, and that the test scores represent the degree of presence or absence of this very real property. As Cronbach and Meehl (1955:284) put it: “Construct validation takes place when an investigator believes that his instrument reflects a particular construct, to which are attached certain meanings. The proposed interpretation generates specific testable hypothesis, which are a mean of confirming or disconfirming the claim. “ This brings us to our first philosophical observation. It has frequently been argued that early validity theorists were positive in their outlook. According to Samuel Messick 1989, 1996a, 1996b: Construct-related validity evidence - refers to the extent to which the test measures the "right" psychological constructs. Intelligence, self-esteem and creativity are examples of such psychological traits. Evidence in support of construct-related validity can take many forms. One approach is to demonstrate that the items within a measure are inter-related and therefore measure a single construct. Inter-item correlation and factor analysis are often used to demonstrate relationships among the items. Another approach is to demonstrate that the test behaves as one would expect a measure of the construct to behave. For example, one might expect a measure of creativity to show a greater correlation with a measure of artistic ability than with a measure of scholastic achievement. Construct validity is not to be identified solely by particular investigative procedures, but by the orientation of the investigator. Criterion-oriented validity, as Bechtoldt emphasizes (Bechtoldt, 1951), "involves the acceptance of a set of operations as an adequate definition of whatever is to be measured." When an investigator believes that no criterion available to him is fully valid, he perforce becomes interested in construct validity because this is the only way to avoid the "infinite frustration" of relating every criterion to some more ultimate standard . In content validation, acceptance of the universe of content as defining the variable to be measured is essential. Construct validity must be investigated whenever no criterion or universe of content is accepted as entirely adequate to define the quality to be measured. Determining what psychological constructs account for test performance is desirable for almost any test. Thus, although the MMPI was originally established on the basis of empirical discrimination between patient groups and so-called normals (concurrent validity), continuing research has tried to provide a basis for describing the personality associated with each score pattern. Such interpretations permit the clinician to predict performance with respect to criteria which have not yet been employed in empirical validation studies (Meehl, 1954, pp. 49-50, 110-111). We can distinguish among the four types of validity by noting that each involves a different emphasis on the criterion. In predictive or concurrent validity, the criterion behavior is of concern to the tester, and he may have no concern whatsoever with the type of behavior exhibited in the test. (An employer does not care if a worker can manipulate blocks, but the score on the block test may predict something he cares about.) Content validity is studied when the tester is concerned with the type of behavior involved in the test performance. Indeed, if the test is a work sample, the behavior represented in the test may be an end in itself. Construct validity is ordinarily studied when the tester has no definite criterion measure of the quality with which he is concerned, and must use indirect measures. Here the trait or quality underlying the test if of central importance, rather than either the test behavior or the scores on the criteria. (Meehl, 1954) Construct validity and truth In the early history of validity theory there was an assumption that there is such a thing as a ‘psychologically real construct’ that has an independent existence in the test taker, and that the test scores represent the degree of presence or absence of this very real property. As Cronbach and Meehl (1955: 284) put it: Construct validation takes place when an investigator believes that his instrument reflects a particular construct, to which are attached certain meanings. The proposed interpretation generates specific testable hypotheses, which are a means of confirming or disconfirming the claim. They assumed that their constructs actually existed in the heads of the test takers. Cronbach and Meehl (1955:248) , ‘make clear what something is” means to set forth the laws in which it occurs. It is refer to the interlocking system of laws which constitute a theory as a nomological network. The idea of a nomological network is not difficult to grasp. Firstly, it contains a number of constructs, and their names are abstract, like those in the list above. In language teaching and testing, ‘fluency’ and ‘accuracy’ are two well-known constructs. Secondly, the nomological network contains the observable variables – those things that we can see and measure directly, whereas we cannot see ‘fluency’ and ‘accuracy’ directly. In testing and assessment this meant that if there is no possible way to test the hypotheses created by the relationship between observable variables, observable variables and constructs, and between constructs, the theory is meaningless, or not‘scientifically admissible’. Construct definition lies at the center of testing and assessment, any validity study is the investigation of the intended meaning and interpretation of test scores. As Messick (1989: 26) puts it (using the term ‘instrumentalist’ for ‘pragmatist’): ‘According to the instrumentalist theory of truth, a statement is true if it is useful in directing inquiry or guiding action.’ Messick (1989: 23) also added from a post-positivistic era that: Nomological networks are viewed as an illuminating way of speaking systematically about the role of constructs in psychological theory and measurement, but not as the only way. The nomological framework offers a useful guide for disciplined thinking about the process of validation but cannot serve as the prescriptive validation model to the exclusion of other approaches. Peirce believed that one day, at some point so far into the future that no one can see it, all researchers would come to a ‘final conclusion’ that is the ''truth, and to which our present truths approximate. Validity theory occupies an uncomfortable philosophical space in which the relationship between theory and evidence is sometimes unclear and messy, because theory is always evolving, and new evidence is continually collected. '''Cutting the validity Cake' According to Cronbach and Meehl, the study of validity has become one of the central enterprises in psychological, educational and language testing. Messick (1989: 20) wrote: Traditional ways of cutting and combining evidence of validity, as we have seen, have led to three major categories of evidence: content-related, criterion-related, and construct-related. However,because content- and criterion-related evidence contributes to score meaning, they have come to be recognized as aspects of construct validity. In a sense, then, this leaves only one category, namely, construct-related evidence. Messick set out to produce a ‘unified validity framework’, in which different types of evidence contribute in their own way to our understanding of construct validity. Messick fundamentally changed the way in which we understand validity. He described validity as: An integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment. (Messick, 1989: 13) Therefore,’ validity’ is not a property of a test or assessment but the degree to which We are justified in making an inference to a construct from a test score. Messick’s way of looking at validity has become the accepted paradigm in psychological, educational and language testing. This can be seen in the evolution of the Standards for Educational and Psychological Testing. In the Technical Recommendations (APA, 1954) the ‘four types’ of validity were described, and by 1966 these had become the ‘three types’ of content, criterion and construct validity. The 1974 edition kept the same categorization, but claimed that they were closely related. In 1985 the categories were abandoned and the unitary interpretation became explicit: Validity is the most important consideration in test evaluation. The concept refers to the appropriateness, meaningfulness, and usefulness of the specific inferences made from test scores. Test validation is the process of accumulating evidence to support such inferences. A variety of inferences may be made from scores produced by a given test, and there are many ways of accumulating evidence to support any particular inference. Validity, however, is a unitary concept. Although evidence may be accumulated in many ways, validity always refers to the degree to which that evidence supports the inferences that are made from the score. The inferences regarding specific uses of a test are validated, not the test itself. (AERA et al., 1985: 9) Test usefulness Bachman and palmer (1996:18) have used the term ‘usefulness’ as a super ordinate in place of construct validity, to include reliability that is the consistency ''of test scores across ''facets of the test. Construct validity, authenticity, defined as the relationship between test task characteristics, and the characteristics of tasks in the real world. Interactiveness is the degree to which the individual test taker’s characteristics (language ability, background knowledge and motivations) are engaged when taking a test. Practicality is concerned with test implementation rather than the meaning of test scores. The validity cline Chapelle has characterized three current approaches to validity. Which they are express as follows: 1. Trait theory: ‘trait’ there is no different from the notion of a ‘construct’. It is assumed that the construct to be tested is an attribute of the test taker. The test taker’s knowledge and processes are assumed to be stable and real, and the test is designed to measure these. Score meaning is therefore established on the basis of correspondence between the score and the actuality of the construct in the test taker. 2. Behaviorist approach: the test score test score is mostly affected by context, such as physical setting, topic and participants. These are typically called ‘facets’ in the language testing literature. In ‘real world’ communication there is always a context – a place where the communication typically takes place, a subject, and people who talk. Behaviorist approach is typified in the work of Tarone (1998), in which it is argued that performance on test tasks varies (within individuals) by task and features or facets of the task. She argues that the idea of a ‘stable competence’ is untenable, and that ‘variable capability’ is the only defensible position. In other words, there are no constructs that really exist within individuals. Rather, our abilities are variable, and change from one situation to another. And according to Fulcher (1995) and Fulcher and Márquez Reiter (2003) have shown that in a behaviourist approach, each test would be a test of performance in the specific situation defined in the facets of the test situation. ‘Validity’would be the degree to which it could be shown that there is a correspondence between the real-world facets and the test facets, and score meaning could only be generalized to corresponding real world tasks. Therefore these two theories are very different in how they understand score meaning and we can understand this in terms of the concept of ‘generalizability’. 3.Pragmatic approach: in language testing there is no such thing as an ‘absolute’ answer to the validity question. The role of the language tester is to collect evidence to support test use and interpretation that a larger community – the stakeholders (students, testers, teachers and society) – accept. But this truth may change as new evidence comes to light .As James (1907: 88) put it, ‘truth happens ''to an idea’ through a process, and ‘its validity is the process of its valid-''ation’ (Italics in the original). Peirce Peirce (undated: 4–5) has suggested that the kinds of arguments we construct in language testing may be evaluated through abduction, or what he later called retroduction. He explains that retroduction is: the process in which the mind goes over all the facts of the case, absorbs them, digests them, sleeps over them, assimilates them, dreams of them, and finally is prompted to deliver them in a form, which, if it adds something to them, does so not only because the addition serves to render intelligible what without it, is unintelligible. I have hitherto called this kind of reasoning which issues in explanatory hypotheses and the like, abduction, because I see reason to think that this is what Aristotle intended to denote by the corresponding Greek term ‘apagoge’ in the 25th chapter of the 2nd Book of his Analytics. But since this, after all, is only conjectural, I have on reflexion decided to give this kind of reasoning the name of retroduction to imply that it turns back and leads from the consequent of an admitted consequence, to its antecedent. Observe, if you please, the difference of meaning between a consequent, the thing led to, and a consequence, the general fact by virtue of which a given antecedent leads to a certain consequent. And in language testing, the validity method is the same: it involves the successful elimination of alternative explanations of the facts. In order to validity investigation a number of criteria have been established by which we might decide which is the most satisfying explanation of the facts: Simplicity, otherwise known as Ockham’s Razor, which states: ‘Pluralitas non est ponenda sine necessitate’, translated as: ‘Do not multiply entities unnecessarily.’ In practice this means: the least complicated explanation of the facts is to be preferred, which means the argument that needs the fewest causal links, the fewest claims about things existing that we cannot investigate directly, and that does not require us to speculate well beyond the evidence available. Coherence, or the principle that we prefer an argument that is more in keeping with what we already know. Testability, so that the preferred argument would allow us to make predictions about future actions, behaviour, or relationships between variables, that we could investigate. Comprehensiveness, which urges us to prefer the argument that takes account of the most facts and leaves as little unexplained as possible. ' ' REFRENCES Bachman, L.F. (1990) Fundamental Consideration in Language Testing. Oxford:Oxford University Press. Bechtoldt , H. P. Selection. In S. S. Stevens (Ed.), Handbook of experimental psychology. New York: Wiley, 1951. Pp. 1237-1267. Brown, H.D. (2004) Language Assessment Principles and Classroom Practices. Longman CHILD, I. L. Personality. Annu. Rev. Psychol., 1954, 5, 149-171. Fulcher, G and Davidson, F. (2007) Language Testing and Assessment: An advanced resource book. '' Goodneough, Florence L. Mental testing. New York: Rinehart, 1950. Routledge Tylor and Francis Group London and New York Farhady, H.Dr. and Jafarpur, A. Dr. and Birjandi, P.Dr. (2004) ''Testing Language Skills From Theory to Practice. The Center for Studying Compiling University Books in Humanities (SAMT) Meehl,P.E. Clinical vs. statistical prediction. Minneapolis: Univer. of Minnesota Press, 1954. Messick, S. (1989). Validity. In R.L. Linn (Ed.), Educational measurement (3rd ed., pp. 13-103). New York: Macmillan. Messick, S. (1996a). Standards-based score interpretation: Establishing valid grounds for valid inferences. Proceedings of the joint conference on standard setting for large scale assessments, Sponsored by National Assessment Governing Board and The National Center for Education Statistics. Washington, DC: Government Printing Office. Messick, S. (1996b). Validity of Performance Assessment. In Philips, G. (1996). Technical Issues in Large-Scale Performance Assessment. Washington, DC: National Center for Educational Statistics.